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Abstract 


We introduce a new machine learning approach for image segmentation that uses a 
neural network to model the conditional energy of a segmentation given an image. 
Our approach, combinatorial energy learning for image segmentation (CELIS) 
places a particular emphasis on modeling the inherent combinatorial nature of 
dense image segmentation problems. We propose efficient algorithms for learning 
deep neural networks to model the energy function, and for local optimization of 
this energy in the space of supervoxel agglomerations. We extensively evaluate 
our method on a publicly available 3-D microscopy dataset with 25 billion voxels 
of ground truth data. On an 11 billion voxel test set, we find that our method 
improves volumetric reconstruction accuracy by more than 20% as compared to 
two state-of-the-art baseline methods: graph-based segmentation of the output 
of a 3-D convolutional neural network trained to predict boundaries, as well as a 
random forest classifier trained to agglomerate supervoxels that were generated by 
a 3-D convolutional neural network. 


1 Introduction 

Mapping neuroanatomy, in the pursuit of linking hypothesized computational models consistent 
with observed functions to the actual physical structures, is a long-standing fundamental problem 
in neuroscience. One primary interest is in mapping the network structure of neural circuits by 
identifying the morphology of each neuron and the locations of synaptic connections between 
neurons, a field called connectomics. Currently, the most promising approach for obtaining such 
maps of neural circuit structure is volume electron microscopy of a stained and fixed block of 
tissue. [4, 18, 19, 12] This technique was first used successfully decades ago in mapping the structure 
of the complete nervous system of the 302-neuron Caenorhabditis elegans; due to the need to 
manually cut, image, align, and trace all neuronal processes in about 8000 50 nm serial sections, even 
this small circuit required over 10 years of labor, much of it spent on image analysis. [33] At the time, 
scaling this approach to larger circuits was not practical. 

Recent advances in volume electron microscopy [13, 22, 17] make feasible the imaging of large 
circuits, potentially containing hundreds of thousands of neurons, at sufficient resolution to discern 
even the smallest neuronal processes. [4, 18, 19, 12] The high image quality and near-isotropic 
resolution achievable with these methods enables the resultant data to be treated as a true 3-D volume, 
which significantly aids reconstruction of processes that do not run parallel to the sectioning axis, and 
is potentially more amenable to automated image processing. 
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Figure 1: Illustration of computation of global energy for a single candidate segmentation S. The 
local energy Es{x;S; I) G [0,1], computed by a deep neural network, is summed over all shape 
descriptor types s and voxel positions x. 


Image analysis remains a key challenge, however. The primary bottleneck is in segmenting the 
full volume, which is filled almost entirely by heavily intertwined neuronal processes, into the 
volumes occupied by each individual neuron. While the cell boundaries shown by the stain provide 
a strong visual cue in most cases, neurons can extend for tens of centimeters in path length while 
in some places becoming as narrow as 40 nm; a single mistake anywhere along the path can render 
connectivity information for the neuron largely inaccurate. Existing automated and semi-automated 
segmentation methods do not sufficiently reduce the amount of human labor required: a recent 
reconstruction of 950 neurons in the mouse retina required over 20000 hours of human labor, even 
with an efficient method of tracing just a skeleton of each neuron [20]; a recent reconstruction of 
379 neurons in the Drosophila medulla column (part of the visual pathway) required 12940 hours of 
manual proof-reading/correction of an automated segmentation [28]. 

Related work: Algorithmic approaches to image segmentation are often formulated as variations on 
the following pipeline: a boundary detection step establishes local hypotheses of object boundaries, a 
region formation step integrates boundary evidence into local regions (i.e. superpixels or supervoxels), 
and a region agglomeration step merges adjacent regions based on image and object features. [ i , 21, 
32, 2] Although extensive integration of machine learning into such pipelines has begun to yield 
promising segmentation results [3, 16, 24], we argue that such pipelines, as previously formulated, 
fundamentally neglect two potentially important aspects of achieving accurate segmentation: (i) the 
combinatorial nature of reasoning about dense image segmentation structure,^ and (ii) the fundamental 
importance of shape as a criterion for segmentation quality. 

Contributions: We propose a method that attempts to overcome these deficiencies. In particular, 
we propose an energy-based model that scores segmentation quality using a deep neural network 
that flexibly integrates shape and image information: Combinatorial Energy Learning for Image 
Segmentation (CELIS). In pursuit of such a model this paper makes several specific contributions: 
a novel connectivity region data structure for efficiently computing the energy of configurations of 
3-D objects; a binary shape descriptor for efficient representation of 3-D shape configurations; a 
neural network architecture that splices the intermediate unit output from a trained convolutional 
network as input to a deep fully-connected neural network architecture that scores a segmentation 
and 3-D image; a training procedure that uses pairwise object relations within a segmentation to 
learn the energy-based model, an experimental evaluation of the proposed and baseline automated 
reconstruction methods on a massive and (to our knowledge) unprecedented scale that reflects the 
true size of connectomic datasets required for biological analysis (many billions of voxels). 

2 Conditional energy modeling of segmentations given images 

We define a global, translation-invariant energy model for predicting the cost of a complete seg¬ 
mentation S given a corresponding image /. This cost can be seen as analogous to the negative 


^ While prior work [32, 16, 2] has recognized the importance of combinatorial reasoning, the previously 
proposed global optimization methods allow local decisions to interact only in a very limited way. 
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log-likelihood of the segmentation given the image, but we do not actually treat it probabilistically. 
Our goal is to dehne a model such that the true segmentation corresponding to a given image can be 
found by minimizing the cost; the energy can reflect both a prior over object configurations alone, as 
well as compatibility between object configurations and the image. 

As shown in Fig. 1, we define the global energy /) as the sum over local energy models (defined 
by a deep neural network) Es{x;S; I) at several different scales s computed in sliding-window 
fashion centered at every position x within the volume: 

S X 

Es{x;S;I) := Es (r,(x; 5); (/)(x;/)). 

The local energy Es{x;S; I) depends on the local image context centered at position x by way 
of a vector representation (t){x] I) computed by a deep convolutional neural network, and on the 
local shape/object configuration at scale s by way of a novel local binary shape descriptor rs{x; S), 
defined in Section 3. 

To find (locally) minimal-cost segmentations under this model, we use local search over the space of 
agglomerations starting from some initial supervoxel segmentation. Using a simple greedy policy, 
at each step we consider all possible agglomeration actions, i.e. merges between any two adjacent 
segments, and pick the action that results in the lowest energy. 

Naively, computing the energy for just a single segmentation requires computing shape descriptors 
and then evaluating the energy model at every voxel position with the volume; a small volume may 
have tens or hundreds of millions of voxels. At each stage of the agglomeration, there may be 
thousands, or tens of thousands, of potential next agglomeration steps, each of which results in a 
unique segmentation. In order to choose the best next step, we must know the energy of ah of these 
potential next segmentations. The computational cost to perform these computations directly would 
be tremendous, but in the supplement, we prove a collection of theorems that allow for an efficient 
implementation that computes these energy terms incrementally. 


3 Representing 3-D Shape Configurations with Local Binary Descriptors 


We propose a binary shape descriptor based on subsampled pairwise connectivity information: given 
a specification s of k pairs of position offsets {ai, 6i},..., {a/c, hk} relative to the center of some 
fixed-size bounding box of size Bg, the corresponding k-hit binary shape descriptor r{U) for a 
particular segmentation U of that bounding box is defined by 


r 



if Oi is connected to bi in U; 
otherwise. 


for i G [1, k]. 


As shown in Fig. 2a, each bit of the descriptor specifies whether a particular pair of positions are 
part of the same segment, which can be determined in constant time by the use of a suitable data 
structure. In the limit case, if we use the list of ah pairs of positions within an n-voxel bounding 
box, no information is lost and the Hamming distance between two descriptors is precisely equal to 
the Rand index. [25] In general we can sample a subset of only k pairs out of the ( 2 ) possible; if we 
sample uniformly at random, we retain the property that the expected Hamming distance between 
two descriptors is equal to the Rand index. We found that picking k = bl2 bits provides a reasonable 
trade-off between fidelity and representation size. While the pairs may be randomly sampled initially, 
naturally to obtain consistent results when learning models based on these descriptors we must use 
the same fixed list of positions for defining the descriptor at both training and test time. ^ 


Note that this descriptor serves in general as a type of sketch of a full segmentation of a given 
bounding box. By restricting one of the two positions of each pair to be the center position of the 
bounding box, we instead obtain a sketch of just the single segment containing the center position. 
We refer to the descriptor in this case as center-based, and to the general case as pairwise, as shown 
in Fig. 2b. We will use these shape descriptors to represent only local sub-regions of a segmentation. 
To represent shape information throughout a large volume, we compute shape descriptors densely at 
all positions in a sliding window fashion, as shown in Fig. 2c. 


^The BRIEF descriptor [5] is similarly defined as a binary descriptor based on a subset of the pairs of points 
within a patch, but each bit is based on the intensity difference, rather than connectivity, between each pair. 
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(b) Shape descriptors are computed at multiple scales. Pairwise descriptors (shown left and center) consider 
arbitrary pairwise connectivity, while center-based shape descriptors (shown right) restrict one position of each 
pair to be the center point. 



(c) Shape descriptors are computed densely at every position within the volume. 

Figure 2: Illustration of shape descriptors. The connected components of the bounding box U for 
which the descriptor is computed are shown in distinct colors. The pairwise connectivity relationships 
that define the descriptor are indicated by dashed lines; connected pairs are shown in white, while 
disconnected pairs are shown in black. Connectivity is determined based on the connected components 
of the underlying segmentation, not the geometry of the line itself. While this illustration is 2-D, in 
our experiments shape descriptors are computed fully in 3-D. 


Connectivity Regions 


As defined, a single shape descriptor represents the segmentation within its fixed-size bounding box; 
by shifting the position of the bounding box we can obtain descriptors corresponding to different 
local regions of some larger segmentation. The size of the bounding box determines the scale of the 
local representation. This raises the question of how connectivity should be defined within these local 
regions. Two voxels may be connected only by a long path well outside the descriptor bounding box. 
As we would like the shape descriptors to be consistent with the local topology, such pairs should 
be considered disconnected. Shape descriptors are, therefore, defined with respect to connectivity 
within some larger connectivity region, which necessarily contains one or more descriptor bounding 
boxes but may in general be significantly smaller than the full segmentation; conceptually, the shape 
descriptor bounding box slides around to all possible positions contained within the connectivity 
region. (This sliding necessarily results in some minor inconsistency in context between different 
positions, but reduces computational and memory costs.) To obtain shape descriptors at all positions, 
we simply tile the space with overlapping rectangular connectivity regions of appropriate uniform 
size and stride, as shown in the supplement. The connectivity region size determines the degree 
of locality of the connectivity information captured by the shape descriptor (independent of the 
descriptor bounding box size). It also affects computational costs, as described in the supplement. 
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4 Energy model learning 

We define the local energy model Eg (r; v) for each shape descriptor type/scale 5 by a learned neural 
network model that computes a real-valued score in [0,1] from a shape descriptor r and image feature 
vector V. 

To simplify the presentation, we define the following notation for the forward discrete derivative of / 
with respect to S\ A^f{S) := f{S -f e) — f{S). 

Based on this notation, we have the discrete derivative of the energy function A^E{S; I) = 
E{S E e; I) — E{S; I), where S E e denotes the result of merging the two supervoxels corre¬ 
sponding to e in the existing segmentation S. To agglomerate, our greedy policy simply chooses at 
step t the action e that minimizes /), where denotes the current segmentation at step t. 

As in prior work [24], we treat this as a classification problem, with the goal of matching the sign of 
I) to A|terror(5'^, S'*), the corresponding change in segmentation error with respect to a 
ground truth segmentation S*, measured using Variation of Information [23]. 


4.1 Local training procedure 


Because the A|t£’(S^; I) term is simply the sum of the change in energies from each position 

and descriptor type s, as a heuristic we optimize the parameters of the energy model Eg {r;v) 
independently for each shape descriptor type/scale s. We seek to minimize the expectation 


E. 


i{Al^, error(Si, S*), Es {rs{xi] Si + e); (/)(xi; /))) + 
i{-A^J, error(Si, S*), Es {rs{x; Si); (l){xi; /))) , 


where i indexes over training examples that correspond to a particular sampled position Xi and a 
merge action applied to a segmentation Si. i{y^ a) denotes a binary classification loss function, 
where a G [0,1] is the predicted probability that the true label y is positive, weighted by |^|. Note that 
if Ay_ error(S'^, S'*) <0, then action e improved the score and therefore we want a low predicted 
score for the post-merge descriptor rs{xi; Si + e) and a high predicted score for the pre-merge 
descriptor rs{xi; Si); if A^* error(S^, S*) > 0 the opposite applies. We tested the standard log loss 
i{y^ a) := \y\ • [ly>o log(<^) + ly<o log(l — cl)]^ as well as the signed linear loss i{y^ a) '•= y - a, 
which more closely matches how the Es{x; Si; I) terms contribute to the overall AgE{S; I) scores. 
Stochastic gradient descent (SGD) is used to perform the optimization. 

We obtain training examples by agglomerating using the expert policy that greedily optimizes 
error(S'^, S'*). At each segmentation state S^ during an agglomeration step (including the initial state), 
for each possible agglomeration action e, and each position x within the volume, we compute the shape 
descriptor pair {x; S^) and {x; S^Ee) reflecting the pre-merge and post-merge states, respectively. 
If rs{x;S^) 7 ^ rs{x; S^ E e), we emit a training example corresponding to this descriptor pair. We 
thereby obtain a conceptual stream of examples (e, A|t error(S^, S*), /), rs(x; S^), rs(x; S^ + 

e)). 

This stream of examples may contain billions of examples (and many highly correlated), far more 
than required to learn the parameters of Eg. To reduce resource requirements, we use priority 
sampling [14], based on | A| error (S', S*) |, to obtain a fixed number of weighted samples without 
replacement for each descriptor type s. We equalize the total weight of true merge examples 
(A| error (S, S*) < 0) and false merge examples (A| error (S, S*) >0) in order to avoid learning 
degenerate models.^ 


5 Experiments 

We tested our approach on a large, publicly available electron microscopy dataset, called Janelia FIB- 
25, of a portion of the Drosophila melangaster optic lobe. The dataset was collected at 8 x 8 x 8 nm 

^For example, if most of the weight is on false merge examples, as would often oecur without balaneing, the 
model can simply learn to assign a score that increases with the number of 1 bits in the shape descriptor. 
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VI 

Rand Fi 

— CELIS (this paper) 

1.672 

0.691 

— 3d-CNN-FGALA 

2.069 

0.597 

— 3d-CNN-F Watershed 

2.143 

0.629 

O Tcolsegl 

2.981 

0.099 

— Oracle 

0.428 

0.901 


Figure 3: Segmentation accuracy on 11-gigavoxel FIB-25 test set. Left: Pareto frontiers of 
information-theoretic split/merge error, as used previously to evaluate segmentation accuracy. [24] 
Right: Comparison of Variation of Information (lower is better) and Rand Fi score (higher is better). 
For CELIS, 3d-CNN+GALA, and 3d-CNN+watershed, the hyperparameters were optimized for each 
metric on the training set. 


resolution using Focused Ion Beam Scanning Electron Microscopy (FIB-SEM); a labor-intensive 
semi-automated approach was used to segment all of the larger neuronal processes within a « 20,000 
cubic micron volume (comprising about 25 billion voxels). [29] To our knowledge, this challenging 
dataset is the largest publicly available electron microscopy dataset of neuropil with a corresponding 
“ground truth” segmentation. 

For our experiments, we split the dataset into separate training and testing portions along the z axis: 
the training portion comprises z-sections 2005-5005, and the testing portion comprises z-sections 
5005-8000 (about 11 billion voxels). 

5.1 Boundary classification and oversegmentation 

To obtain image features and an oversegmentation to use as input for agglomeration, we trained 
convolutional neural networks to predict, based ona35x35x9 voxel image context region, whether 
the center voxel is part of the same neurite as the adjacent voxel in each of the x, y, and z directions, as 
in prior work. [31] We optimized the parameters of the network using stochastic gradient descent with 
log loss. We trained several different networks, varying as hyperparameters the amount of dilation of 
boundaries in the training data (in order to increase extracellular space) from 0 to 8 voxels and whether 
components smaller than 10000 voxels were excluded. See the supplementary information for a 
description of the network architecture. Using these connection affinities, we applied a watershed 
algorithm [35, 36] to obtain an (approximate) oversegmentation. We used parameters Ti = 0.95, 
Th = 0.95, Tg = 0.5, and Tg = 1000 voxels. 


5.2 Energy model architecture 

We used five types of 512-dimensional shape descriptors: three pairwise descriptor types with 9^, 
17^, and 33^ bounding boxes, and two center-based descriptor types with 17^ and 33^ bounding 
boxes, respectively. The connectivity positions within the bounding boxes for each descriptor type 
were sampled uniformly at random. 

We used the 512-dimensional fully-connected penultimate layer output of the low-level classification 
convolutional neural network as the image feature vector 0(x; I). For each shape descriptor type s, 
we used the following architecture for the local energy model Eg {r;v): we concatenated the shape 
descriptor vector and the image feature vector to obtain a 1024-dimensional input vector. We used 
two 2048-dimensional fully-connected rectified linear hidden layers, followed by a logistic output 
unit, and applied dropout (with p = 0.5) after the last hidden layer. While this effectively computes a 
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score from a raw image patch and a shape descriptor, by segregating expensive convolutional image 
processing that does not depend on the shape descriptor, this architecture allows us to benefit from 
pre-training and precomputation of the intermediate image feature vector /) for each position x. 
Training for both the energy models and the boundary classifier was performed using asynchronous 
SGD using a distributed architecture. [11] 


5.3 Evaluation 

We compared our method to the state-of-the-art agglomeration method GALA [24], which trains 
a random forest classifier to predict merge decisions using image features derived from boundary 
probabilities. ^ To obtain such probabilities from our low-level convolutional neural network classifier, 
which predicts edge affinities between adjacent voxels rather than per-voxel predictions, we compute 
for each voxel the minimum connection probability to any voxel in its 6-connectivity neighborhood, 
and treat this as the probability/score of it being cell interior. 

For comparison, we also evaluated a watershed procedure applied to the CNN affinity graph output, 
under varying parameter choices, to measure the accuracy of the deep CNN boundary classification 
without the use of an agglomeration procedure. Finally, we evaluated the accuracy of the publicly 
released automated segmentation of FIB-25 (referred to as Vcolsegl) [15] that was the basis of 
the proofreading process used to obtain the ground truth; it was produced by applying watershed 
segmentation and a variant of GALA agglomeration to the predictions made by an Ilastik [27] -trained 
voxel classifier. 

We tested both GALA and CELIS using the same initial oversegmentations for the training and test 
regions. To compare the accuracy of the reconstructions, we computed two measures of segmentation 
consistency relative to the ground truth: Variation of Information [23] and Rand Fi score, defined as 
the Fi classification score over connectivity between all voxel pairs within the volumes; these are the 
primary metrics used in prior work. [30, 8, 24] The former has the advantage of weighing segments 
linearly in their size rather than quadratically. 

Because any agglomeration method is ultimately limited by the quality of the initial oversegmentation, 
we also computed the accuracy of an oracle agglomeration policy that greedily optimizes the 
error metric directly. (Computing the true globally-optimal agglomeration under either metric is 
intractable.) This serves as an (approximate) upper bound that is useful for separating the error due to 
agglomeration from the error due to the initial over segmentation. 


6 Results 


Figure 3 shows the Pareto optimal trade-offs between test set split and merge error of each method 
obtained by varying the choice of hyperparameters and agglomeration thresholds, as well as the 
Variation of Information and Rand Fi scores obtained from the training set-optimal hyperparameters. 
CELIS consistently outperforms all other methods by a significant margin under both metrics. The 
large gap between the Oracle results and the best automated reconstruction indicates, however, that 
there is still large room for improvement in agglomeration. 

While the evaluations are done on a single dataset, it is a single very large dataset; to verify that 
the improvement due to CELIS is broad and general (rather than localized to a very specific part of 
the image volume), we also evaluated accuracy independently on 18 non-overlapping 500^-voxel 
subvolumes evenly spaced within the test region. On all subvolumes CELIS outperformed the best 
existing method under both metrics, with a median reduction in Variation of Information error of 19% 
and in Rand Fi error of 22%. This suggests that CELIS is improving accuracy in many parts of the 
volume that span significant variations in shape and image characteristics. 


"^GALA also supports multi-channel image features, potentially representing predicted probabilities of 
additional classes, such as mitochondria, but we did not make use of this functionality as we did not have training 
data for additional classes. 
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7 Discussion 


We have introduced CELIS, a framework for modeling image segmentations using a learned energy 
function that specihcally exploits the combinatorial nature of dense segmentation. We have described 
how this approach can be used to model the conditional energy of a segmentation given an image, and 
how the resulting model can be used to guide supervoxel agglomeration decisions. In our experiments 
on a challenging 3d microscopy reconstruction problem, CELIS improved volumetric reconstruction 
accuracy by 20% over the best existing method, and offered a strictly better trade-off between split 
and merge errors, by a wide margin, compared to existing methods. 

The experimental results are unique in the scale of the evaluations: the 11-gigavoxel test region is 2-4 
orders of magnitude larger than used for evaluation in prior work, and we believe this large scale of 
evaluation to be critically important; we have found evaluations on smaller volumes, containing only 
short neurite fragments, to be unreliable at predicting accuracy on larger volumes (where propagation 
of merge errors is a major challenge). While more computationally expensive than many prior 
methods, CELIS is nonetheless practical: we have successfully run CELIS on volumes approaching 
^ 1 teravoxel in a matter of hours, albeit using many thousands of CPU cores. 

In addition to advancing the state of the art in learning-based image segmentation, this work also has 
significant implications for the application area we have studied, connectomic reconstruction. The 
LIB-25 dataset refiects state-of-the-art techniques in sample preparation and imaging for large-scale 
neuron reconstruction, and in particular is highly representative of much larger datasets actively 
being collected (e.g. of a full adult fly brain). We expect, therefore, that the significant improvements 
in automated reconstruction accuracy made by CELIS on this dataset will directly translate to a 
corresponding decrease in human proof-reading effort required to reconstruct a given volume of tissue, 
and a corresponding increase in the total size of neural circuit that may reasonably be reconstructed. 

Euture work in several specific areas seems particularly fruitful: 

• End-to-end training of the CELIS energy modeling pipeline, including the CNN model 
for computing the image feature representation and the aggregation of local energies at 
each position and scale. Because the existing pipeline is fully differentiable, it is directly 
amenable to end-to-end training. 

• Integration of the CELIS energy model with discriminative training of a neural network- 
based agglomeration policy. Such a policy could depend on the distribution of local energy 
changes, rather than just the sum, as well as other per-object and per-action features proposed 
in prior work. [24, 3] 

• Use of a CELIS energy model for fixing undersegmentation errors. While the energy 
minimization procedure proposed in this paper is based on a greedy local search limited to 
performing merges, the CELIS energy model is capable of evaluating arbitrary changes to 
the segmentation. Evaluation of candidate splits (based on a hierarchical initial segmentation 
or other heuristic criteria) would allow for the use of a potentially more robust simulated 
annealing energy minimization procedure capable of both splits and merges. 

Several recent works [26, 34, 7, 6] have integrated deep neural networks into pairwise-potential 
conditional random field models. Similar to CELIS, these approaches combine deep learning with 
structured prediction, but differ from CELIS in several key ways: 

• Through a restriction to models that can be factored into pairwise potentials, these ap¬ 
proaches are able to use mean field and pseudomarginal approximations to perform efficient 
approximate inference. The CELIS energy model, in contrast, sacrifices factorization for the 
richer combinatorial modeling provided by the proposed 3-D shape descriptors. 

• More generally, these prior CRE methods are focused on refining predictions (e.g. improving 
boundary localization/detail for semantic segmentation) made by a feed-forward neural 
network that are correct at a high level. In contrast, CELIS is designed to correct fundamental 
inaccuracy of the feed-forward convolutional neural network in critical cases of ambiguity, 
which is reflected in the much greater complexity of the structured model. 
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Table 1: GALA and CELIS results on 18 non-overlapping 500^-voxel subvolumes within FIB-25 test 
region. Each subvolume is identified by the (x, y, z) coordinates of its start comer. 


Subvolume 

GALA 

CELIS 

VI 

Rand Fi 

VI 

Rand Fi 

(3056, 2228, 5006) 

0.710920 

0.877058 

0.639525 

0.883413 

(3684, 2228, 5006) 

0.852962 

0.807559 

0.742159 

0.847611 

(2428, 2856, 5006) 

0.550208 

0.939376 

0.543177 

0.948212 

(3056, 2856, 5006) 

0.902916 

0.763665 

0.676395 

0.801584 

(3056, 2228, 5634) 

0.825079 

0.830070 

0.674668 

0.880949 

(3684, 2228, 5634) 

0.912993 

0.843953 

0.692192 

0.868810 

(2428, 2856, 5634) 

0.806402 

0.866520 

0.731852 

0.893787 

(3056, 2856, 5634) 

0.896207 

0.882145 

0.748106 

0.903290 

(2428, 2228, 6262) 

0.724371 

0.901122 

0.579692 

0.941264 

(3056, 2228, 6262) 

0.991092 

0.848851 

0.806581 

0.897247 

(3684, 2228, 6262) 

0.971468 

0.787515 

0.747754 

0.861740 

(2428, 2856, 6262) 

0.881123 

0.869841 

0.795054 

0.897715 

(3056, 2856, 6262) 

1.113600 

0.844898 

0.877817 

0.904204 

(2428, 2228, 6890) 

0.963757 

0.902604 

0.733394 

0.953988 

(3056, 2228, 6890) 

0.983061 

0.844851 

0.870729 

0.854900 

(3684, 2228, 6890) 

0.966499 

0.883959 

0.764519 

0.910047 

(2428, 2856, 6890) 

1.380077 

0.710782 

0.869191 

0.821951 

(3056, 2856, 6890) 

1.128732 

0.713933 

0.916494 

0.846875 


Table 2: Network architecture used for oversegmentation and image features. 


Layer 

Input 

Transform 

Output 

# parameters 

Dropout (p) 

1 

1 X 35 X 35 X 9 

5x5x1 convolution, ReLU 

64 X 31 X 31 X 9 

64 • (52 + 1) 

0.9 

2 

64 X 31 X 31 X 9 

5x5x5 convolution, ReLU 

64 X 27 X 27 X 5 

64 • (64 -53 + 1 ) 

0.9 

3 

64 X 27 X 27 X 5 

2x2x1 max pooling 

64 X 14 X 14 X 5 


0.9 

4 

64 X 14 X 14 X 5 

5x5x5 convolution, ReLU 

64 X 10 X 10 X 1 

64 • (64 -53 + 1 ) 

0.9 

5 

64 X 10 X 10 X 1 

2x2x1 max pooling 

64 X 5 X 5 X 1 


0.9 

6 

64 X 5 X 5 X 1 

Fully-connected ReLU 

512 

512 • (64 -52 + 1 ) 

0.5 

7 

512 

Fully-connected logistic 

3 

3- (512 + 1) 
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(a) Small-scale (b) Large-scale 


Figure 4: Connectivity region tiling. The connected components of the segmentation within each 
connectivity region C (shown in distinct colors) are maintained independently. The yellow rectangle 
within each connectivity region indicates the bounds of the set of (type s) shape descriptor center 
positions computed using C, which is simply the set of center positions for which the shape descriptor 
bounding box is contained within C. The white rectangle (of size Bs) indicates the bounding box of 
the shape descriptor (necessarily contained within C). 
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Figure 5: Examples of cases where local boundary classification alone leads to false splits of neurites. 
A cross-section of the raw data is shown on the left; the correct segmentation (determined by careful 
human annotators) of the central neurite is overlayed on the right. Neuronal processes often narrow 
to nearly the limit of the image resolution, and when this is coupled with a loss of contrast, it appears 
to be impossible to determine the correct segmentation from local boundary information alone. 
These examples are from a Drosophila larval neuropil dataset [24] imaged using Focused Ion Beam 
Scanning Electron Microscopy (FIBSEM) [24]. 
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Figure 6: Examples of cases where independent neurite shape modeling breaks down. At these 
synapse sites, the pre-synaptic and post-synaptic neurons each have characteristic shapes that are 
highly unlikely to occur independently but are jointly very likely. Due to the close contact between the 
two neurons, local boundary classification at these sites often results in false mergers, making correct 
shape modeling particularly critical. A cross-section of the raw data is shown on the left; the correct 
segmentation (determined by careful human annotators) is overlayed on the right. These examples are 
from a Drosophila larval neuropil dataset [24] imaged using Focused Ion Beam Scanning Electron 
Microscopy (FIBSEM) [24]. 



Figure 7: Distinction between local and global connectivity. In the cross-section of raw data on the 
left, there is clear evidence that the two points indicated within the yellow bounding box are separated 
by cell membrane. From the manual annotation overlaid on the right, it is clear, however, that they 
are nonetheless part of the same cell, highlighted in red. Thus, within a sufficiently local area the two 
points are disconnected, but globally they are connected. Distinguishing the connectivity of points at 
multiple scales is critical for accurate shape modeling. If connectivity is represented only globally, as 
in prior agglomeration work [2, 24], it may be impossible to reconcile strong local evidence of a cell 
boundary between two parts of the same sell in cases of self-contact, leading to poor learning and 
incorrect predictions for these cases. This example is from a Drosophila larval neuropil dataset [24] 
imaged using Focused Ion Beam Scanning Electron Microscopy (FIBSEM) [24]. 
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A Local Connectivity 


To reliably distinguish between local and global connectivity, we represent segmentations globally as 
an undirected graph over voxels. The vertices of this graph correspond to positions in and edges 
are typically limited to occur between neighboring voxel positions, for some definition of neighboring. 
We will define our neighborhood J\f{x) to be the von Neumann neighborhood (6-connectivity), though 
the Moore neighborhood or any other (symmetric) neighborhood could equally well be used. 

The segments themselves are implicitly defined by the connected components of this graph, in contrast 
to a representation defined by an explicit labeling of voxels by the component to which they belong. 
The advantage of this representation is illustrated in Fig. 8. 

B Local shape descriptors 

Each connectivity region C is a rectangular subset of the full volume. 

Definition 1. Given a shape descriptor specification s and connectivity region C, we denote by 
the set of (type s) shape descriptor center positions for which the descriptor bounding box is 
contained within C. 

Remark. Note that is a rectangular region obtained by simply shrinking the rectangular region 
C by {Bs — l)/2 on all sides (recall that Bg is the shape descriptor bounding box for type s shape 
descriptors). 

We wish to represent shape information at multiple scales, and to represent both the joint shape of 
nearby objects as well as the shape of individual objects. Therefore, rather than using a single shape 
descriptor specification s and a single connectivity region tiling, we use a set of shape descriptor 
specifications s, each implicitly associated with a particular choice of connectivity region size Bg 
and stride strides (specified by 3-D vectors of integers) that define a overlapped tiling of the full 
segmentation space. 

Definition 2. We define Cg to be the set of connectivity regions obtained as regular overlapping tiles 
of size Bg and stride strides. 

To ensure that the bounding box for a shape descriptor at a given position is contained in exactly one 
connectivity region, we constrain Bg and strides as follows: 

Bg < Bg] 

Bg = Bg — strides + 1. 

These constraints ensure that Cg exactly partitions the set of shape descriptor center positions, which 
allows us to make the following definition: 

Definition 3. We denote by Cg{x) the single C e Cg such that x G X^. 

For convenience, we will also introduce some notation that applies to general undirected graphs that 
is relevant to our discussion: 

Definition 4 (Connected components). Given an undirected graph G, we denote by /C(G) the 
partition of the vertex set of G into connected components, and denote hy K{v] G) the connected 
component G containing the vertex v. 

Definition 5 (Induced subgraph). Given an undirected graph G and a subset V' of its vertices, we 
denote by G\y'] the subgraph of G induced by V'. 

While globally we will represent a segmentation S' as a voxel graph, within a given connectivity region 
G we are concerned only with the connected components /C(S[G]) in the subgraph of S induced by 
G. Note that because the vertices of S correspond to voxels, i.e. positions in I?, JC{S[G]) C 2^ . 
Based on these definition, we can more precisely state how local shape descriptors are defined. 

Definition 6. Given a full segmentation S, for each shape descriptor specification s, we define the 
|s|-bit local binary shape descriptor rg{x] S) at position x by 

{x; S) := lKix+a-,S[C])=Kix+b-,S[C]) for {a, b} e s, 

where G = Cg{x). 
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(b) Component representation 


(c) Component representation 



(a) Graph representation 


Figure 8: Advantage of voxel graph representation. The top row shows a representation of a 
segmentation as either a voxel graph or a component labeling. The bottom row shows the effect 
of restricting the segmentation to a sub-region. Each square corresponds to a voxel. In the graph 
representation, a white line between two voxels indicates an edge, while a black line indicates the 
lack of an edge. In the component representation, each voxel is labeled by a component identifier (0 
or 1). The different colors (red, blue, and grey) correspond to different connected components. The 
graph representation, shown on the left, correctly disconnects the two parts when restricted to the 
sub-region. The component labeling representation, shown in the middle, is unable to represent the 
presence of a boundary between the two parts, and therefore incorrectly results in a single connected 
component even when restricted to the sub-region. It is possible to emulate a voxel graph using a 
component representation by indicating boundaries with a 1-voxel wide background component, as 
shown on the right, but this tends to be cumbersome. 


Definition 7. Given a segmentation S, we define the component visibility set Vs{x; S) C IC{S[C]) 
of a position x to be the set of connected components at positions sampled by the shape descriptor s: 

Vs{x; S) := {K{x -h c; 5'[C]) | c G {a, 6} G s}, 


where C = Cs{x). 

Lemma 1. Let a shape descriptor specification s, a position x, and segmentations S and S' be given. 
Let C = Cs{x). IfVs{x;S) = Vs{x;S') (in particular if JC{S[C]) = 1C{S'[C])), then rs{x; S) = 
rs{x; S'). Furthermore, in the case that s is center-based, then ifK{x; 5'[C]) = K{x; S'[C]), then 
rslx;S) = rs{x;S'). 


Proof. The first statement follows directly from Definition 6. 

To prove the second statement, suppose that s is center-based. Note that for all {a, 6} G s, {a, 6} = 
{0, c} for c G {a, b}. Thus, we have 

= ^K(x-,S[C])=K(x+c\S[C]) 

= l(a:+c)6if(a:;S[C]) for {0, c} = {a, b} G S. 

The result follows. □ 

Remark. For general shape descriptor specifications s, rs{x; S) depends on S only by way of the 
subset of JC{S[Cs(^x)]) that are sampled, and for center-based shape descriptor specifications, rs{x; S) 
depends on S only by way of K{x; S[Cs{x)]), the single component in S[C] that contains x. 
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C Efficient energy minimization 


Naively, computing the energy for just a single segmentation requires computing shape descriptors 
and then evaluating the energy model at every voxel position with the volume; a small volume may 
have tens or hundreds of millions of voxels. At each stage of the agglomeration, there may be 
thousands, or tens of thousands, of potential next agglomeration steps, each of which results in a 
unique segmentation. In order to choose the best next step, we must know the energy of all of these 
potential next segmentations. The computational cost to perform these computations directly would 
be tremendous. 

We will discuss several computational tricks that allow us to efficiently compute these energy terms 
incrementally. Because the cost of evaluating the local energy model for a single shape descriptor is 
many times more expensive than computing the shape descriptor, we structure our computation such 
that we only recompute a local energy term if the shape descriptor on which it depends has changed. 
This ensures that the total cost of evaluating the local energy terms is minimized, but even computing 
just the shape descriptors at each position within the volume for each potential agglomeration action 
at each step would still be prohibitively expensive. We therefore rely on geometric and region graph 
information to prune out the vast majority of this computation as well. Collectively, these tricks 
reduce the computational cost by several orders of magnitude; the effectiveness of these techniques is 
ultimately data-dependent, however. 

C.l Action representation 

Recall that each agglomeration action e corresponds to a set of additional voxel edges to be added to 
the current segmentation state S^. While in principle agglomeration could be defined with respect 
to arbitrary sets of voxel edges, we will carefully choose the set of actions to be considered in 
order to preserve the distinction between local and global connectivity while also allowing for a 
computationally-efficient implementation. 

We will define actions in terms of adjacent supervoxels AT, K' G 1C{S^) in the initial segmentation: 
Definition 8. For any two distinct connected components AT, AT' G /C(5'^), let 

eK,K' •= {{x, x'} I x' G M{x) A (x, x') e K X AT'}. 

Remark. If K and K' are not adjacent, then eK,K' = 0- 

Note that we represent edges in the undirected voxel graph simply as two-element sets of voxel 
positions. 

Definition 9. We define the supervoxel merge action set 

:= {eK,K' 7^ 0 I K, K' G /C(5') A K K'}. 

We will use := Asq as our set of actions for agglomeration. Note that each action corresponds 
to a set of voxel graph edges. At each step t of agglomeration, we choose an action G The 
set of remaining actions A^ after step t is simply the subset of actions in A that have not yet been 
performed, i.e. = A^ — {e^}. The segmentation state 5'^+^ := 5'^ + e^. 

Definition 10. If e is a set of edges and C is a set of vertices, we denote by e[C] the restriction of e 
to vertices in C, i.e. the subset of edges in e that are incident to two vertices in C. If S' is a graph, we 
define e[S] := e[vertices(S)] to be the restriction of e to vertices in S. 

Given a graph S and a partition T of vertices(S), we denote by G/T the contraction of G by T. 

Definition 11. A set e of voxel edges is said to be a supervoxel merge in a voxel graph S of 
components AT, K' G 1C{S) if e[S] is a non-empty set of edges between components K and AT', or 
equivalently, that every edge in e[S] corresponds to the edge {AT, AT'} in S//C(S). 

Definition 12. A set e of voxel edges is said to be a redundant merge in a voxel graph S, corre¬ 
sponding to the component K G /C(S), if e[S] is a non-empty set of edges within component AT, i.e. 
{{K{a; S),K{b; S)} | {a, 6} G e} = {{Ff}}, or equivalently, that every edge in e corresponds to a 
self edge {AT} in S/1C{S). 

Lemma 2. If e is a redundant merge in S, then /C(5' + e) = IC{S). 
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Proof. This follows from the fact that adding an edge between two vertices already part of the same 
connected component does not change set of connected components. □ 


Definition 13. Let e, e' be supervoxel merges in S. We say that e is incident to a connected 
component K G 1C{S) in S if every edge in e is incident to a voxel in K, i.e. e is incident to K in 
S/]C{S). We say that e is incident to e' in S if there exists K G componentsS to which both e and 
e' are incident, i.e. e is incident to e' in S/IC{S). 

Lemma 3. If e is a supervoxel merge in S and S is a spanning subgraph of S', then e is a supervoxel 
merge or redundant merge in S'. If e is a redundant merge in S, then e is a redundant merge in S'. 


Proof. Suppose e is a supervoxel merge in S, corresponding to iT, K' G ]C{S). There must exist a 
components J, J' G 1C (S') with K C J and K' C J' . If J = J' , then e is a redundant merge in S'; 
otherwise e is a supervoxel merge of { J, J'}. 

Suppose e is a redundant merge in S corresponding to K G IC{S). There must exist a component 
J G /C(5") with K ^ J. Hence, e is a redundant merge in S' corresponding to J. □ 

Remark. S is necessarily a spanning subgraph of S' + e for any merge action e. 

Lemma 4. At all steps t, all e ^ A are either supervoxel merges or redundant merges in S^. 


Proof. This follows from Lemma 3 and the fact that all e G A are supervoxel merges in S^. □ 


The consequence of this lemma is that globally each merge action corresponds to a pair of connected 
components. Within the induced subgraph S'^IC] of restricted to a given connectivity region C, 
however, this lemma does not necessarily hold, even for S^[C], because a connected component of 
S^ may correspond to more than one connected component of ^^[(7]. For computational reasons that 
will be made apparent in Appendix C.2, we would like to ensure that it does hold, so that each merge 
action also corresponds to a pair of connected components within each connectivity region C (or is 
redundant within C). 

To do this, we will assume that each connected component of S^ is a clique. Our assumption sacrifices 
any distinction between local and global connectivity within the original supervoxels of 5'^, but this 
is a small sacrifice given that they are expected to be small. 

Lemma 5. Given a connectivity region C, ife is a supervoxel merge in S^ and e[C] is non-empty, 
then e is either a supervoxel merge or a redundant merge in S^ [C] for all t. 


Proof. Suppose e is a supervoxel merge in S^ of components iTi, 7^2 G JC{S^). For all {a, b} G e[C], 
without loss of generality we can assume a G Ffi and b E K 2 . By our assumption that Ki and K 2 
are cliques in S^, KinC,K 2 r[C G IC{S^[C]). By the definition of e[C], we have a e KiCiC and 
b e K 2 C] C. Hence, e is a supervoxel merge in 5'^[C']. The result follows from Lemma 3. □ 

C.2 A representation 

Recall that we defined the forward discrete derivative of / with respect to S' by: 

A|/(5) ;= f{S + e)-f{S). 

We also define the second discrete derivative: 

AffiS) := A^A^7(5). 

To efficiently implement a local search over agglomerations, at each step t of agglomeration, for 
each possible next agglomeration action e, we maintain the discrete derivative AgtE{S'^; /), where 
denotes the current segmentation at step t. Although our energy model is defined without any 
reference to supervoxels or merges, we prove a number of key properties that enable us to very 
efficiently compute and update these discrete derivative terms. 
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To maintain AgtE{S^; /), conceptually we must initially compute AgoEs{x; I) for each position 
X and action e, and then at each subsequent step t, agglomeration action is taken and we update 

= A%,E{S^;I) 

I) for all e G 


Theorem 1 (Descriptor-based pruning). Let a position x and image I be given. Let f{S') := Vs (x; S') 
and_E{S') := Es{x; S'; I). Given a segmentation S, and merge e, if f{S) = f{S + e), then 
AgE{S) = 0. Furthermore, for any merge e', 

d:= A^/e{S) = 

FE{S) -E{SEe) 

—E(^S + e') -\-E(^S + + e), 


where some or all of the 4 terms can be canceled based on whether r{S) = r{SEe), r{S) = r{SEe'), 
f{S + e) = f{S + e' + e), and/or f{S + e') = f{S + e' + e). In particular, 

f{S) = f{S + e) A f{S + e') = f{S + e' + e) 

^ d = 0; 

f{S) ^ f{S + e) A f{S + e') = f{S + e' + e) 

^ d = E{S)-E{SEe); 

f{S) = f{S + e) A f{S -h e') f{S + e' + e) 
d = E{SEe' Fe)- E{S E e'). 

By symmetry of the theorem with respect to e and c! we also have: 

f{S) = f{S + e') A f{S E e) = f{S + e' + e) 
d = 0; 

f{S) = f{S + e') A f{S Ee)^ f{S + e' + e) 
d = E{SEe' Fe)- E{S + e); 

f{S) ^ f{S + e') A f{S E e) = f{S + e' + e) 
d = E{S)-E{SEe'). 


Proof For the first statement, ifr(5') = r(5' + e),we have 

E{S)=Es {f{S);f{x;I)) 

= Es {f{SEe);f{x;I)) 
= E{SEe). 


The result follows. 

The second statement is a straightforward result of the same cancellation principle. □ 

Remark. This theorem allows us to skip a large fraction of evaluations of the local energy model, 
which is in general significantly more expensive than just computing the shape descriptors (which 
must still be done in order to check the conditions of this theorem). If a packed bitvector representation 
is used, the cost of the descriptor comparisons is negligible. 

C.3 Connectivity region-based pruning 

Recall that for every merge action e exactly one of the following is true: 

1 . e is a supervoxel merge in 5'^ [C ]; 
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2. e is a redundant merge in S^[C]; 

3. e[C] = 0. 

Definition 14. For each connectivity region C, we define the active action set A^[C] C to be the 
subset of actions at step t that are supervoxel merges in S^[C]. 

Lemma 6. Given a connectivity region C, if e ^ A^[C], then e ^ A^ [C] for all t' > t 

Proof Suppose e ^ A^[C]. Then either e[C] = 0 or e is a redundant merge in S^[C]. If e[C] = 0, 
then e ^ A^ [C] for any t'. Alternatively, if e is a redundant merge in S^[C], then since S^[C] is a 
spanning subgraph of [C], by Lemma 3 e is a redundant merge in [C]. □ 

Theorem 2 (Connectivity region-based pruning). Given a position x, time step t, and merge e G A^, 
let C = Cs{x) and let A' = A^[C]. If e 0 A', then AgtEs{x; I) = 0. Furthermore, if 

{e, e'} 2 ihen A^f Es{x; S^;I) = 0. 

Proof We will begin by proving the first statement. Suppose e ^ A'. By definition of A', it follows 
that e is a redundant edge in S^[C], i.e. IC{S^[C]) = IC{{S^ + ^)[E])- By Lemma 1, we have 
rs{x; S^) = rs{x; 5'^ + e) = r. The result follows from the first part of Theorem 1. 

Next we will consider the second statement. Since A^f Es{x; S^;I) = A^^fEs{x;S^;I), the 
second statement is symmetric with respect to e and e' . It is sufficient, therefore to again consider 
the case that e ^ A'. By the first statement, AgtEs{x; S^;I) = 0. Since is a spanning subgraph 
of F e', it is likewise the case that e is a redundant merge in {S^ + e')[C], which implies that 
rs{x; + e') = rs{x; 5'^ + e' + e). The result follows from the second part of Theorem 1. □ 

Remark. Because each action is typically active in only a tiny fraction of the connectivity regions, 
this theorem allows us to dramatically limit our computation. 

C.4 Graph-based pruning 

Lemma 7. Let a segmentation S and a supervoxel merge e in S be given. Let K G IC{S) be a 
connected component of S. If e is not incident in S to K, then K G /C(5' + e), i.e. merging e in S 
does not affect K. 

Proof. This follows from the fact that by definition of incidence of a supervoxel merge, no edge in e 
is incident to any voxel in K. □ 

Theorem 3 (Graph-based pruning). Suppose s defines a center-based descriptor. Let a segmentation 
S, position x, and supervoxel merges e and c! in S be given. Let C = Cs{x). If e is not incident 
in S[C] to K{x; S[C]), then rs{x;S) = rs{x;S + e) and AgEs{x; S; I) = 0. Furthermore, if 
e is not incident in {S + E)[C] to K{x; {S + e')[C]), or c! is not incident to e in S[C], then 

AfEs{x;S;I) = 0. 

Proof. We will being by proving the first statement. Suppose e is not incident in S[C] to K := 
K{x; 5'[C]). By Lemma 7, we have K = K{x;{S e)[C]). By Lemma 1 this implies that 
rs{x; S) = rs{x; S e). The result follows from the first part of Theorem 1. 

Next we will consider the second statement. Note that the condition that e is incident in {S + e') [C] 
to K{x; {S + e') [C]) is equivalent to the condition that e is incident in S[C] to K := K{x; 5'[C]), or 
e' is a supervoxel merge of K and K' in S[C] (i.e. incident to e in 5'[C]) and e is incident to K' in 

There are two cases to consider: suppose e is not incident in {S + e')[C] to K{x; {S + e') [C]). Then 
since S' is a spanning subgraph of S + e', it follows that e is also not incident in S[C] to K{x; ^[(7]). 
The result follows from applying the first statement of the theorem to both S and S F e' and then 
using Theorem 1 . 

Alternatively, suppose e' is not incident to e in S[C]. This implies that 7f(x; ^[(7]) is incident to at 
most one of {e, e'} in S[C]. By the symmetry of the theorem with respect to e and e', we will assume 
without loss of generality that e is not incident to K{x; S[(7]) in S[C]. By our note above, we can 
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infer that the condition for our first case, that e is not incident to K{x] {S + e')[C]) in {S + e')[C], 
holds. □ 


Remark. This theorem demonstrates that for center-based descriptors, we can significantly limit 
computation based on the agglomeration graph structure. The cost of maintaining the incidence 
information is negligible. 


C.5 Visibility-based pruning 

Lemma 8. Let a position x, segmentation S and supervoxel merge e of components Ki and K 2 in 
S[C], where C = Cs{x), he given. If e is incident in S[C] to at most one component in Vs{x; S), 
then rs{x; S) = rs{x; S + e). 


Proof. There are two cases to consider. If e is not incident in S[C] to any component in Vs {x; S), then 
by Lemma 7, Vs{x; S) = Vs{x; S' -j- e). The result follows from Lemma 1. If e is incident in S[C] to 
exactly one component Ki G Vs{x; S), then Vs{x; S + e') = Vs{x; S) + {K[ U K2} — {^1}, i.e. 
merging e' in S adds additional voxels (not part of any visible component) to one visible component. 
Since these additional voxels are, by definition, not sampled by the shape descriptor, it follows that 
rs{x;S) = rs{x;S + e). □ 

Theorem 4 (Visibility-based pruning). Given a position x, and segmentation S, let C = Cs{x) 
Let e! he a supervoxel merge of components Ki and K2 in S[C]. If e' is not incident in S[C] 

to any component Ki G Vs{x; S), then AgEs{x; S; I) = 0, and Es{x;S;I) = 0 for all 
supervoxel merges e in S[C]. If e' is incident in S[C] to exactly one component Ki G Vs{x; S), 
then for all supervoxel merges e of K[^K'2 in S[C] not incident to K2 in S[C], i.e. K2 ^ 

AfEs{x;S;I) = 0. 

Proof. To prove the first statement, suppose e' is not incident in S[C] to any component in Vs {x;S). 
For any supervoxel merge e in S[C], it must be the case that E is incident in (S -f e)[C] to at 
most one component in Vs{x; S E e). By applying Lemma 8 to both S[C] and S[C + e], we have 
rs{x; S) = rs{x; S -1- e') and rs{x; S e) = rs{x; S e e'). The result follows from Theorem 1. 

To prove the second statement, suppose E is incident in S[C] to exactly one component Ki G 
Vs{x; S). As for the first statement, by Lemma 8 we have rs{x; S) = rs{x; S E E). Let e be a 
supervoxel merge of ATj, incident to K2 in S[C]. If e is incident to ATi, then E is 

incident in {S + e)[C] to exactly one component {K[ E K 2 ) ^ Tfi. If e is not incident to Ki, then 
E is incident in (S + e)[C] to exactly the one component Ki. Therefore, by Lemma 8 we have 
rs{x; S E e) = rs{x; S E e E E) and the result follows from Theorem 1. □ 


Determining whether a given component AT G S'[C] is a member of the exact visibility set Vs{x] S) 
for all positions x G Xq is computationally expensive, i.e. 0(|X^| • |5|). However, to satisfy the 
conditions of Theorem 4, it is sufficient to check membership in any superset of the visibility set; 
this restricts the conditions under which pruning is done, but we can choose a superset in which 
membership can be checked much more efficiently. 

Definition 15. For d-dimensional vectors a, 6 G we denote by the hyperrectangle 

a| := {x G I a < x < b}. 


Definition 16. Given a segmentation S, we define the approximate component visibility set 
Vs{x; S) C IC{S[C]) of a position x to be the set of connected components at positions within 
a bounding box of size Bs centered at x: 


V.ix; S) := {Kix + c;S[C]) 


c€ R 


(Bs-l)/2 

-{Bs-l)/2 


}. 


where C = Cs{x). 

Lemma 9. Given a segmentation S, Vs{x; S) C Vs{x; S). 


Proof. This follows from the fact that {a, b} C -f)% ^ 


□ 
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Definition 17. For two coordinate vectors a and b,aQb denotes the element-wise product. 

For a given component K G S[C], by first computing a summed area table [10], we can efficiently 
determine whether K G Vs {x ; S[C]) for all positions x G X^, as described in Algorithm 1. The 
computational cost is 0(|C|). To check the conditions of Theorem 4 for a given supervoxel merge 
e' of Ki^K2 G S[C], we simply apply Algorithm 1 to both Ki and K2. Alternatively, to check 
only the (more limited) first condition that {Ki, K2} C] Vs{x; S) = 0 , then it is sufficient to apply 
Algorithm 1 just once to KiU K 2 . 


Algorithm 1 Optimized membership test for approximate component visibility sets. 

Require: (G, +) is a commutative group with identity Og- 
1: function ComputeSummedAreaTable(A: ^ G, Rl) 

2: Declare array T: G 

3: for X G : \\x — ajlg < d do 

4: ^ Og 

5: end for 

6: for X G j do > Iteration over x must respect the usual partial ordering on Z^. 

7: ^ “ 1) + E (_l)i+lklli . 

ze{o,iV-{6} 

8 : end for 

9: return T 

10: end function 

11: function SummedAreaTableLookup(T: ^ G, R^^, C R^J 

12: return ^ • T(6 + (a - 6) 0 2 :) 

ze{o,iV 

13: end function 

14: function ComputePositionsWithVisibility(s, S,C, K e 5'[G]) 

15: Define A{x) := \k=k(x-,S[C]) 

16: T ComputeSummedAreaTable(A, C) 

17: X ^ 0 

18: for X G do 

19: if SummedAreaTableLookup(T, ^ then 

X (-Dg 1 

20: X — X U \^x^ 

21: end if 

22 : end for 

23: return X 

24: end function 


At agglomeration steps t > 0, we can apply Theorem 4 with e' = ^ and e G [G] in order 

to limit the set of positions x and edges e for which the change in local energy A^f Es{x;S^; I) 
must be computed. In principle, we could apply Theorem 4 to all candidate actions e' G A^[G] at 
a given agglomeration step t, but this would require computing separate summed area tables for 
all components K G X{S^[C]) incident to a candidate action, which would involve considerable 
overhead. Therefore in practice the theorem is only applicable for t > 0. 

C.6 Zone-based pruning 

In the case of a pairwise shape descriptor specification s, we cannot apply Theorem 3, and conse¬ 
quently based only on Theorem 2, for each position x we must compute shape descriptors for all 
actions e G A^[Cs{x)]. Theorem 4 primarily allows us to prune positions x but not actions e, and is 
not applicable at the initial state t = 0. 

Ait = 0, the number of positions that must be considered within a given connectivity region G 
is exactly at later steps t > 0 the number of positions may be reduced due to Theorem 4 

but nonetheless tends to grow linearly with \Xlj\. The size of the active set A^[G] tends to grow 
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superlinearly in \C\. Hence, the computational cost of shape descriptor computation based only on 
the pruning theorems we’ve introduced thus far grows superquadratically in \C\. 

To mitigate this effect, we could of course simply ensure that connectivity regions are very small. 
A larger number of small connectivity regions does, however, introduce additional overhead, as 
explained in Appendix D, and therefore may actually increase the computational cost. Furthermore, 
reducing the connectivity region size also affects the extent to which shape descriptors reflect local 
or global connectivity, and we would like to be able to choose that independently of computational 
concerns. 

We therefore introduce a subdivision of connectivity regions into zones. 

Definition 18. For each connectivity region C e Cs, the zone set is a partition of X^. 

We can extend our definition of component visibility sets, previously defined only for individual 
positions in Definition 7, to sets of positions: 

Definition 19. The component visibility set Ws{Z; S) for a zone Z is defined by 

Definition 20. The zone visibility set W~^{K;C) is the set of zones whose component visibility set 
contains K : 

W-\K; C) := {Z G | K e Ws{Z; S)}, 
where S is some segmentation for which K G 1C{S[C]). 

Remark. The zone visibility set does not depend on the segmentation S beyond the fact that K G 
1C{S[C]). By definition, a merge that does not affect a connected component K' does not affect its 
zone visibility set C). 

Theorem 5. Given a supervoxel merge e of Ki^ K 2 in S[C], merging e in S has the effect of merging 
the zone visibility sets of Ki and K 2 : 

WpiKi U K2; C) = Wp{Ki-,C) U Wp{K 2 ; C). 

Proof To show that the Wf^{Ki;C) U Wf^{K 2 ;C) contains U K 2 ;C), let Z G 

Wf^{Ki U K 2 ]C) be given. Then G Z, c G {a, 6} G s such that K{x ^ c] S'[C]) = 
{Ki {J K 2 ), where S' is the segmentation that results from the merge of Ki and K 2 in S. Hence, 
K{x + c; S[C]) C 7 ^ 2 }, and it follows that Z G C) U Wf\K 2 ; C). 

To show that Wf^Ki U K 2 ;C) contains Wf^{Ki;C) U Wf\K 2 ;C), let Z G Wf\Ki;C) 
be given. Then G Z, c G {a, 6} G s such that K{x c; S[C]) = Ki. It follows that 
K{x + c; S'[C]) = {Ki U K 2 ). and therefore Z G Wf\Ki U K 2 ; C). □ 

Definition 21. A supervoxel merge e of Ki^K 2 in S[C] is said to be active in zone Z of S[C] if 
Z eWf\Ki;C)nWf\K2;C). 

Definition 22. We denote by the active action set of zone Z of connectivity region C at time 

t, the set of actions e in [C] that are active in zone Z of 5'^ [C]. 

Theorem 6 . If a supervoxel merge e in S[C] is not active in zone Z, then for all positions x G Z 
we have rs{x; S) = rs{x; S + e), AgEs{x; S; I) = 0. Furthermore, given a supervoxel merge e' in 
S[C], if a supervoxel merge e in {S + e')[C] is not active in zone Z, then for all positions x G Z, 

Af/Es{x-S-I) = ^. 

Proof. To prove the first statement, suppose the supervoxel merge e in S[C] of components AT, K' is 
not active in zone Z, and x G Z. Then by Definition 21, K'} g Vs{x] S). Hence, by Lemma 8 

rs{x] S) = rs{x] S E e), and the result follows from Theorem 1. 

To prove the second statement, suppose the supervoxel merge e of components Ki , K 2 in [S + e') [C] 
is not active in zone Z of (S' + e')[C']. By the first statement of the theorem, this implies that 
rs{x] S E e') = rs{x; S E e' E e) for all x G Z. It remains to be shown that e is also not active in 
zone Z of S[C]. By Definition 21, 

Z ^Wp{Kv,C)nWp{K2-,C). (1) 

There are two cases to consider: 
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1. If e is not incident to e' in S[C], then {Ki,K 2 } C IC{S[C]) and it follows from Eq. (1) and 
Definition 21 that e is also not active in zone Z of S[C]. 

2. Alternatively, if e is incident to e' in S[C], then without loss of generality we can assume 

that K 2 C IC{S[C]) and Ki = K[[J K 2 , where e' is a supervoxel merge of in S[C], 

and e is a supervoxel merge of K 2 in S[C]. Then by Theorem 5, 

W-\KuC) = W-HK[ U C) 

= w-Hk[-,c)dw-Hk[-,c). 

It follows that Z ^ WZ^{K[;C) D Wz^{K 2 ; C), which by Definition 21 implies that e is 
not active in zone Z of S[C]. By the first statement of the theorem, we have rs{x; S) = 
rs{x; S e) for all x G Z. The result follows from Theorem 1. □ 

If we ensure that the total number of zones, \Zs^c\A^ limited to a small constant, e.g. 64, then we 
can efficiently represent the zone visibility set C) for each component AT as a bit vector. 

Maintaining these visibility sets over the course of agglomeration, per Theorem 5, requires only 
bitwise disjunction operations; determining whether a supervoxel merge e is active in a zone Z, per 
Definition 21, requires only bitwise conjunction. 

Based on Theorem 6, the cost of computing all unpruned shape descriptors within a connectivity 
region C can be formulated as 

^ [{\Z\+a)-\A^^[C\\]+ms,c\), 

zeZs,c 

where a represents the overhead per action active in a zone, and /3 is a non-decreasing function 
that specifies an additional overhead for a given number of zones; a and [3 may represent either 
computational or memory costs. 

Minimizing this cost exactly is in general a hard integer programming problem. We find a locally- 
optimal solution using an approach that mirrors our approach for minimizing the global energy 
E{S; I): we start with an initial set of zones, either single-voxel zones or a regular grid, and greedily 
merging zones in order to reduce the cost. 

D Implementation 

A high-performance implementation of our agglomeration procedure is critical for testing and 
applying it to the large datasets inherent to neuronal reconstruction. The implementation challenges 
are, however, considerable: 

• Conceptually the local search over the space of agglomerations depends on the value of an 
enormous number of distinct local energy terms. 

• The pruning tricks described in Appendix C greatly reduce the number of shape descriptor 
and local energy model computations, but at the cost of significant algorithmic complexity. 

• We wish to be able to use a high-dimensional image feature representation (l){x; /). Storing 
the precomputed 512-dimensional image features over just as a small 256^ voxel volume 
in 32-bit floating point format requires 34 GB of memory. While in absolute terms this is 
not a large amount of memory, it limits the number of independent volumes that may be 
agglomerated in parallel on a single machine, and for reasonable cost-effectiveness it is 
necessary, therefore, that a single agglomeration be able to take advantage of multiple cores. 

• The computational steps required are not primarily standard operations like convolutions, 
Fourier transforms, matrix multiplications, or other linear algebra operations for which 
there has already been extensive study of efficient implementation techniques and for which 
high-performance implementations (for single and multiple CPU cores, as well as for GPU 
platforms) are already available. 

To address these challenges, we designed a parallel pipeline that, at agglomeration step t, determines 
which shape descriptors and local energy terms need to be computed, performs those computations, 
and updates I) for candidate actions e, in order that the action e that greedily minimizes 

E{S^ ^ e; I) may be chosen. 


24 


For each agglomeration step t 



Figure 9: High-level CELIS agglomeration procedure. Arrows show the flow of data (indicated by 
rectangles) and control (indicated by rounded rectangles). At a high-level, agglomeration proceeds 
in a sequential manner. At each agglomeration step t, the next action if selected to greedily 
minimize the global energy E{S^ -1- e^; /). If the best E' decreases the energy by more than r, i.e. 
A^gtE{S^; I) < r, then agglomeration continues. Otherwise, agglomeration terminates. To save 
computation at the cost of greater memory use, the image feature vector 0(x; /) for all positions x are 
precomputed prior to the start of agglomeration. The parallel pipeline used to initialize and update 
the action scores (steps 2 and 4) is shown in detail in Fig. 1 1; the details of the data structures that are 
updated by these steps are shown in Fig. 10. 


D.l Data structures maintained during agglomeration 

This pipeline is based around several interlinked data structures, as shown in Fig. 10: 

• The initial segmentation serves to define the agglomeration space over which our local 

search will operate. While conceptually we represent segmentations as an undirected 
graph over voxels (as described in Appendix A), we assume for simplicity that each initial 
supervoxel, i.e. connected component u G is a clique, as described in Appendix C.l. 

This allows us to unambiguously represent by labeling each voxel with an integer 

that uniquely identifies the supervoxel that contains it. Note that it is not in general the 
case that the connected components of at steps t > 0 are cliques, meaning that we 
cannot unambiguously represent by a component labeling. In fact we do not explicitly 
represent the segmentation at later steps; instead it is represented implicitly by the initial 
segmentation and the sequence of actions a^,..., that have been performed. 

• The global set of actions A^. As described in Appendix C.l, each action e e corresponds 
to a pair {u,v} C IC{S^), i.e. e = eu,v We represent each action eu,v by the pair of 
integer identifiers corresponding to the supervoxels u and v. The action set at any step 
t > 0 is simply A^ — {e^ \t' < t}. For each action e G we also maintain the set of 
connectivity regions C in which it is active, i.e. e G A^[(7]. This allows Theorem 2 to be 
applied efficiently. Recall that by Lemma 6, actions are removed from A^ [C] during the 
course of agglomeration, but are never added. Thus, once an action e is no longer active in 
any connectivity region, it ceases to affect the global energy. 

• The key information that the pipeline serves to maintain is the change in global energy, 
A^gtE{S^; /), that would result from merging each action e. This change in energy is 
essentially a score associated with the action. Our agglomeration procedure follows the 
greedy policy of choosing at each step the action with the lowest (i.e. most negative) score. 
Therefore, for each active set e we store the associated score, and we also maintain a priority 
queue over the scores, to allow for efficiently finding the edge with the lowest score. 
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Figure 10: Data structures for implementing CELIS agglomeration. Arrows indicate the links that 
make up the data structures. 

For each connectivity region C, we maintain a disjoint sets data structure that maps supervoxels 
u G /C(5'^) to connected components of Ff G 1C{S^[C]). For each connected component K, we 
maintain a list of incident supervoxel merge actions e G to allow for efficient application of 
Theorem 3. This represents the multigraph obtained by contracting the connected components 
of S^[C]. For each shape descriptor specification s for the connectivity region is used, we also 
maintain the zone information and zone visibility sets (represented as bit vectors) for each connected 
component K. 

For each action e G we store A^gtE{S^; /), which serves as the ordering key for a priority queue 
over actions used for greedy agglomeration. We also maintain for each action e the set of connectivity 
regions for which e G A^[C], for efficient application of Theorem 2. 


• Another major component is a data structure representing the set of connectivity regions, i.e. 
the union of the connectivity region tilings Cs for each shape descriptor specification s.^ The 
set of connectivity regions remains fixed throughout agglomeration. For each connectivity 
region, we maintain the following information: 

- A mapping from global supervoxels u G JC{S^) (represented by unique integer iden¬ 
tifiers) to connected components K G IC{S^[C]) within the connectivity region (also 
represented by unique integer identifiers within each connectivity region, separate 
from the global supervoxel identifiers). We handle the mapping of global supervoxel 
identifiers using a hash table, and we maintain the connected components using a 
standard disjoint sets data structure based on union by rank and path compression. [9, 
p. 505] 


^In the typical case that different tile sizes Bs and strides strides are used for each specification s, these 
tilings Cs will be disjoint (but certainly overlapping, as they cover the same space), meaning that each connectivity 
region C is associated with only one specification s. In general, though, there may be multiple shape descriptor 
specifications s for which C is used, i.e. C Sharing a connectivity region for multiple shape descriptor 
specifications slightly reduces memory and computational overhead, because the per-connectivity region data 
structures, namely the connected components 1C{K) and active action sets A^[C], only have to be stored and 
maintained once. 
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- The active set A^[C] of actions that affect connectivity within C. 

- For each component K G 1C{S^[C]), the set of incident actions e G A^[C]. Each 
incident action corresponds to a supervoxel merge of K and some other component 
K' G IC{S^[C]) in S^[C]. There may, however, be two distinct actions e, e' G A^[C] 
that are both supervoxel merges of the same two components K and K'. These 
sets of incident actions therefore correspond to the adjacency lists of the multigraph 
S^[C]/IC{S^[C]). 

- For each shape descriptor specification s for which C e Cs (typically there may only 
be one such s), we additionally maintain: 

* The partition Zs^c of represent each zone compactly as the union of 

disjoint rectangular regions. 

* For each component K G JC{S^[C]), the zone visibility set Wz^{K;C) repre¬ 
sented as a bit vector. 

• Because the image feature representation (j){x; I) is typically expensive to compute, and the 
same feature is used for computing Es{x; S; I) for many different candidate segmentations 
S, we precompute the image features for all positions x and store the feature vectors in a 
giant 4-D array. In practice the maximum volume size that can be agglomerated is limited 
by the available memory for storing the precomputed image feature array. 

D.2 Parallel pipeline for updating action scores 

The pipeline for updating action scores is shown in Fig. 11. The same overall flow of control and 
data is used both (a) to compute the initial E{S^; I) scores for all actions e e A^ prior to 
agglomeration, and (b) to incrementally update the AgtE{S^; I) scores from the prior agglomeration 

step by adding A^f E{S^; I) to reflect the agglomeration action chosen. At a high level, it consists 
of the following operations: 

• Steps 2-6: Preprocessing to determine the set of (x, e) position/action pairs for which 

we must compute shape descriptors rs{x; S^), rs{x; + e), and in the incremental case 
rs{x; E e^) and rs{x; 5'^ + + e). This preprocessing is where connectivity region- 

based pruning (Theorem 2), graph-based pruning (Theorem 3), visibility-based pruning 
(Theorem 4), and zone-based pruning (Theorem 6) applies. 

• Step 7: Computation of shape descriptors rs{x; rs{x; -h e), and in the incremental 

case rs{x; + e^) and rs{x; 5'^ + + e) for the necessary (x, e) position/action pairs. 

According to descriptor-based pruning (Theorem 1), we determine which local energy terms 
must be computed. 

• Steps 9-10: Computation of local energy terms needed to compute non-zero 
A^gtEs{x] S^; I) terms or, in the incremental case, non-zero A^f Es{x; S^; I) terms. 

• Steps 11-12: Updating the global action scores based on the local energy changes. 

The pipeline executes using all available processors on a single machine, through the use of a thread 
pool. The low-level details of the pipeline steps are as follows: 

1. (a) Before agglomeratioii/(b) Pick action E' to minimize E{S^ ^ I). 

2. Determine connectivity regions to update. In the non-incremental case, all connectivity 

regions C G UgCg must be processed. In the incremental case, per Theorem 2, only 
connectivity regions in C G {C G UgCg | G must be processed. Because we 

maintain this set of connectivity regions for each action e G there is only constant (low) 
overhead for each connectivity region processed, and no cost for connectivity regions not 
processed. 

3. Per-connectivity region processing: The connectivity regions that must be updated are 
processed in parallel. While most processing is actually done at the finer per-zone granularity, 
certain information is computed per-connectivity-region and per associated shape descriptor 
s \ C ^ Cs'. 

• Component label map: a 3-D array that maps positions in the space to compo¬ 
nents in ]C{S^ [C]), represented by integer identifiers. This is computed by mapping the 
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Figure 11: Pipeline for updating CELIS action AgtE{S^; I) scores. Arrows show the flow of 
data (indicated by rectangles) and control (indicated by rounded rectangles). The same pipeline 
is used both to compute the AgoE{S^; I) scores non-incrementally (starting at la) at the start of 
agglomeration, and to incrementally (starting at lb) update the A^gtE{S^; I) scores from the previous 

step by adding A^f E{S^; /). Dashed lines indicate steps and dependencies that apply only to the 
incremental case. Green or red lines indicate steps and dependencies that apply only to pairwise or 
center-based shape descriptor specifications s, respectively. To limit the complexity of the diagram, 
the dependencies on the persistent data structures shown in Fig. 10 are omitted. In the non-incremental 
case (la), the set of connectivity regions to update will be the full set UgCs and the set of merges 
to update (determined by step 2) will be the full active set A^[C]. In the incremental case (lb), the 
zone visibility sets are updated in step 4 per Theorem 5 to reflect the merge of K[ and prior 
to computing shape descriptors, to allow the conditions of Theorem 6 to be checked conveniently; 
the connected components (represented as disjoint sets of initial supervoxels IC{S^)), which affect 
the component label map ^ JC{S^[C]), are updated in step 13 only after computing shape 
descriptors, because the incremental update depends on computing shape descriptors rs{x; S^) and 
rs{x; 5'^+^) based on both the existing and next segmentation state. 
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Algorithm 2 Computation of a single shape descriptor. 

Require: s is a shape descriptor specification. 

Require: /C is a set of components, represented by integers. 
Require: F: JC maps shape descriptor offsets to components. 

1: function ComputeDescriptor(s, F) 

2: Declare | s \ -bit vector r 

3: if 5 is pairwise then 

4: for {a, b} e s do 

5: ^ lF(^a)=F(b) 

6 : end for 

7: else 

8: K ^ F(0) 

9: for {a, 0} G 5 do 

10: ^ tF(a)=K 

11: end for 

12 : end if 

13: return r 

14: end function 


Algorithm 3 Computation of shape descriptor changes (non-incremental case). The result is (a) 
a stream of position/shape descriptors pairs produced by calls to EmitDescriptor(x, r), which 
returns the stream position; (b) a separate stream of score adjustments produced by calls to 
EmitScoreAdjustment(m,that associate a merge m = {Ki,K 2 } with a negative and 
positive energy contribution corresponding to previously emitted shape descriptors at stream positions 
i~ and respectively. The computation of individual shape descriptors is shown in Algorithm 2. 

Require: A c is a set of positions. 

Require: L: X ^ 1C maps positions in X to components in 1C. 

Require: M: /C^2W maps components in 1C to sets of merges. 

1: function ComputeDescriptorChanges(s, A, L, M) 

2: Declare array 1C ^ 1C 

3: for A G A do 

4: 'ip^K) F- K > Initialize 'ip to the identity map. 

5: end for 

6: for X G A do 

7: r F- ComputeDescriptor(5, C ha L{x -h c)) 

8: i i -1 >—1 represents an invalid index 

9: for {Ki, K 2 } eM{L{x)) do 

10: V^(A2) ^ Ai 

11: Ve ^ ComputeDescriptor(5, c ha pj{L{x -h c))) 

12: 'ip{K 2 ) ^ K 2 > Restore pj to identity map. 

13: if r 7 ^ Te then 

14: if i = —1 then i F- EmitDescriptor(x, r) 

15: ie F- EmitDescriptor(x, Tq) 

16: EmitScoreAdjustment({Ai, A 2 }, i, A) 

17: end if 

18: end for 

19: end for 

20 : end function 
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Algorithm 4 Computation of shape descriptor changes (incremental case). 

Require: C /C is a merge. 


1: 

function ComputeDescriptorChangesIncremental(s, {K[,K 2 }, X, L, M) 

2: 

Declare arrays : 1C ^ 1C 


3: 

tor K eX do 


4: 

^ K > Initialize 

and to the identity map. 

5: 

end for 


6: 

^ K[ 


7: 

for X G X do 


8: 

r ComputeDescriptor(s, C ha l{x + c)) 


9: 

fe' ^ COMPUTEDESCRIPTOR(s, C ha (j)'{L{x + c))) 


10 

Iq' i — — 1 \> - 

-1 represents an invalid index 

11 

{or {Ki,K 2 } € M{L{x)) do 


12 

^{K2) ^ Ki 


13 

J[ ^ J' ^ i,'{K'2) 


14 

\{K 2 G {K[,K' 2 } then 


15 



16 

else 


17 



18 

end if 


19 

re ComputeDescriptor(s, C hA ^p{L{x + c))) 


20 

re,e' G- COMPUTEDESCRIPTOR(s, C hA 'tp' {L{x + c))) 

21 

^(i^2) ^ K2 

> Restore ^ to identity map. 

22 

^P'{Ki) ^ ij'{K2) ^ 4 

> Restore 7 /^' to initial value. 

23 

if Ve ^ re,e' then 


24 

if re' 7 ^ re,e' then 


25 

if ief = — 1 then 4' ^ EmitDescriptor(x, rg 

) 

26 

ig g/ ^ EmitDescriptor(x, Tg e /) 


27 

EmitScore Adjustment ({ Xi, X 2 }, ie',le,e') 


28 

end if 


29 

if r 7 ^ Tg then 


30 

if i = —1 then i ^ EmitDescriptor(x, r) 


31 

ie ^ EmitDescriptor(x, Tg) 


32 

EmitScore Adjustment ({ Xi, X 2 }, ig, i) 

> Note the order of ig and i. 

33 

end if 


34 

else if r / rg/ then 


35 

if i = —1 then i ^ EmitDescriptor(x, r) 


36 

if ief = —1 then ig/ G- EmitDescriptor(x, rg ) 


37 

EmitScoreAdjustment({Xi, X 2 }, ie ', i) 

\> Note the order of ig/ and i. 

38 

end if 


39 

end for 


40 

end for 


41 

end function 
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supervoxel identifier for each position x G which is precisely what is stored to rep¬ 
resent to the corresponding component based on the map from global supervoxels 
]C{S^) to connected components ]C{S^[C]) in C that we maintain. 

• K[,K 2 G 1C{S^[C]) merged by (incremental only): We also use the global su¬ 
pervoxel to local connected component map to translate the action to the pair of 
components K[^K 2 G IC{S^[C]) for which it is a supervoxel merge. Note that it is 
guaranteed that is a supervoxel merge in C because in the incremental case we only 
process connectivity regions C for which G A^[C]. 

• Visibility summed area table (incremental only): We compute a single summed 
area table for K[ U K 2 based on the component label map according to Algorithm 1. 

4. Determine the set of actions to update. In this step, for a given connectivity region, we 

determine the set of actions e G A^[C] for which me may potentially need to compute 
shape descriptors rs(x; S^), rs{x; -f e), and in the incremental case rs{x; e), 

according to Theorem 2. Note that these actions will additionally be filtered in step 6 on a 
per-zone basis. In the non-incremental case, and also in the incremental case for pairwise s, 
all actions e G A^[C] — {e^} must be (potentially) processed. In the incremental case for 
center-based s, only actions e ^ incident to in [C] must be processed, per Theorem 3. 

Because we maintain the set of actions incident to each component in JC{S^[C]), computing 
this set requires only constant time per action to be processed. 

Outputs: 

• Merges to update: the set Me of merges, i.e. pairs of components {Ki,K 2 } C 
IC{S^[C]) (represented as pairs of integer component identifiers) merged by the actions 
to be processed. Note that the same pair of components may correspond to more than 
one action e G but computation of shape descriptors depends only on the pair of 
components merged by the action. We therefore use the component representation to 
avoid redundant computations. 

• Merges to action map: a mapping from each component pair in Me to the set of one 
or more corresponding actions: 

{Xi,X2}g Me ^ 

\ui e Ki ^U2 e K 2 ]. 

This is implemented as a hash table mapping pairs of component identifiers to lists 
of actions. Because energy terms will be locally computed per component pair rather 
than per action, but globally we maintain per-action scores, this mapping is used to 
efficiently update all corresponding global per-action scores according to each local 
per-component-merge score. 

• Zone visibility sets (incremental only): The zone visibility sets, which are repre¬ 
sented as a mapping from integer component identifiers to bit vectors, are updated in 
this step per Theorem 5 to reflect the merge of K[ and in (5'^ + e^) [C]. This simply 
involves taking the bit-wise OR of the bit vectors. 

5. Per-zone processing: It is at the granularity of zones that shape descriptor computation 
actually happens. All zones are processed independently, and in parallel (zones of separate 
connectivity regions are also processed in parallel) to the extent that there are available 
cores. Zone processing does, however, depend on certain read-only data structures that 
are computed per-connectivity region and shared by all zones, including the component 
label map , the set of merges M^ to potentially update, and in the incremental case, the 
visibility summed area table. 

6. Determine set of merge/position pairs to update in zone Z. The purpose of this step 

is to finish preprocessing in order to finalize the set of (x, e) position/merge pairs for 
which we will compute shape descriptor changes. Per Theorem 6 and Definition 21, we 
filter the set of per-connectivity-region merges to update M^ based on the zone visibility 
sets Mf := {{Xi, X 2 } e M^ \ Z e C) H C)}, where in the non- 

incremental case AT* := K but in the incremental case AT* is the component AT G /C((5'^ + 
e^) [C]) that contains K. Note that in the implementation this happens transparently because 
the zone visibility bit vectors that are maintained for each component identifier are updated 
in the incremental case in step 4 to reflect 5'^+^ = + e^. We output either this flat merge 
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set directly or a table of merges incident to each component in {S^ e^)[C], depending on 

whether s is pairwise or center-based. 

Outputs: 


• Positions to update: In the non-incremental case, the set of positions to update 

is simply := Z. In the incremental case, we apply Algorithm 1 to the visibility 

summed area table precomputed in step 3 in order to determine the subset of positions 
X^ C Z that must be updated. The time complexity is linear in |Z|. To limit 
preprocessing overhead, we only use the first condition of Theorem 4 and do not test 
the more complicated second condition. 

• Merge set Mf (pairwise s only): In the case of a pairwise shape descriptor specifi¬ 
cation s, we can perform no further merge pruning, and must process all merges in 

Mf. 

• Component to merge set map (center-based s only): In the case of a center- 
based shape descriptor specification s, the subset of merges in that must be 
processed for a given position x depends on K{x; S^[C]) in the non-incremental 
case, or K{x; {S^ -h e^)[C]) in the incremental case. We therefore compute a table 

Ms : X{S^[C]) 2^^ that maps 

KeX{S^[C])^ 


{{K^,K2} e Mf I K* G 

where is defined as above. In the non-incremental case, each merge in will 
occur exactly twice in the table. In the incremental case, each merge will occur exactly 
3 times in the table, because every merge in is necessarily incident in S^[C] to 
(^ 1 ,^ 2 )- 

7. Compute shape descriptors: computation of shape descriptors rs{x; S^), rs{x; + e), 
and in the incremental case rs{x;S^ + e^) and rs{x;S^ + -1- e) for all (x,e) po¬ 

sition/merge pairs determined in step 6. To abstract the difference between pair¬ 
wise and center-based descriptors, in the case of pairwise s, we define Mg ‘ 
X{S^[C]) as the constant function K . In the non-incremental case, 

we invoke ComputeDescriptorChanges( 5, Lf, Aff ) defined in Algorithm 3. 

In the incremental case, we invoke ComputeDescriptorChangesIncremental 
( 5 , 7 ^ 2)5 ) defined in Algorithm 4. 

Outputs: 


• Action score adjustments: the list of {{Ki, K 2 },x,r ~tuples specifying updates 
to the global action scores, implicitly associated with a particular shape descriptor 
specification s. Each update in the list applies to the one or more global actions that 
are supervoxel merges of Ki and K 2 in S^[C], and corresponds to subtraction of 
Es (r~; (j){x] /)) and addition of Eg (r+; /)). Specifically, if we let U denote the 

aggregate set of all action score adjustments (s, {ATi, 7 ^ 2 }, , r+), then we have 


+ [Es (r+; Hx; /)) - 4 (r-; I)) 

{a,{Ki ,K2},x,r~ ,r'^)QU:ui^KiAu2EK2 


( 2 ) 

(3) 

(4) 


We represent the connected components Ki and K 2 by their corresponding integer 
identifiers. The same shape descriptors r~ and/or may occur in multiple action score 
adjustments, e.g. if they are equal to rg{x; S^) or rg{x; + e^). To avoid redundant 
storage in memory and redundant evaluation of the local energy model, we do not 
directly specify x,r~, and in our representation of the action score adjustments 
list. Instead, we specify r~ and as integer offsets i~ and into the list of shape 
descriptors and shape descriptor positions also output by this step. 

• Shape descriptors/Shape descriptor positions: equal length lists specifying the non- 
redundant shape descriptors/position pairs required by at least one action score adjust¬ 
ment. The lists are constructed in such a way that the (r, x) pairs are guaranteed to 
be unique. The entries are grouped by position x, meaning that if all (r, x) pairs for a 
given position x are contiguous. 
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8. Per-batch processing of shape descriptors: Evaluation of the local energy model on sin¬ 
gle shape descriptor/image feature pairs may be significantly more expensive than batch 
evaluation on multiple such pairs. For example, the matrix-vector multiplication required 
for typical fully-connected neural network activation can be much more efficiently im¬ 
plemented batch-wise as a matrix-matrix multiplication. We therefore collect the shape 
descriptor/position pairs output from step 7 into batches up to some maximum batch size, e.g. 
256. Because different local energy models are used for each shape descriptor specification 
s, batches are segregated by specification s. We 

9. Extract image features. We simply copy the image feature vectors (j){x] I) for each 
position X in the list of shape descriptor positions for the current batch from the in-memory 
precomputed image feature array. 

Output: 

• Image feature vectors: temporary array holding the copied image feature vectors 
contiguous in memory. 

10. Compute local energy. We evaluate in local energy terms Eg {r;v) for the current batch of 
shape descriptors r and image feature vectors v. 

Output: 

• Local energy terms: the list of local energy scores corresponding to the list of shape 
descriptors in the current batch. 

11. Update action scores. In this step, we update the global action scores according to Eq. (2), 
using the local energy terms computed in step 10 that are referenced by the action score 
adjustments computed in step 7. To determine the set of (global) actions that correspond to 
each pair of local connected components specified in the action score adjustments, we we 
use the merge to action map computed in step 4 for the connectivity region. 

12. Update action priority queue. After all updates to global action scores are complete, we 
must correct the ordering of the action priority queue. When performing the initial action 
score computation prior to agglomeration, we can simply construct the heap in linear time. 
In the incremental case, we correct the placement of just the action for which the score was 
updated. 

13. Update connected components (incremental only). In the incremental case, after com¬ 
puting the update action scores, we update within each affected connectivity region the 
disjoint sets data structure over supervoxels and the multigraph over connected components 
to reflect the merge e^. We do not perform this update until after updating the action scores 
because in step 7 we need to compute shape descriptors for the segmentation states and 

-h e, which would not be possible after merging . 

E Performance results 

We also measured the effectiveness of each of the computational pruning tricks described in Ap¬ 
pendix C. Essentially the entire computational cost of CELIS is in computing shape descriptors and 
evaluating the local energy models; the cost of performing the pruning and other preprocessing turns 
out to be negligible (less than 1%). Therefore the savings in descriptors processed correspond directly 
to savings in overall runtime. With pruning, computation of shape descriptors accounted for about 
20% of the cost; the remainder was spent evaluating the energy model. Without it, the cost is several 
orders of magnitude higher. The results are shown in Fig. 12. 


33 





Agglomeration step 


Agglomeration step 


Figure 12: Effect of pruning on number of shape descriptors computed. The vertical axis specifies the 
cumulative number of shape descriptors computed during the course of agglomeration, using different 
combinations of pruning rules. The horizontal axis specifies the agglomeration step t, with f = 0 
indicating the computation required to initialize the energy first derivative terms. Non-incremental 
corresponds a naive implementation that does no pruning or incremental computation whatsoever. 
The different combinations of CR, Visibility, Zone, and Graph correspond to correspond to applying 
combinations of Theorem 2, Theorem 4, Theorem 6, and Theorem 3, respectively. The actual number 
of descriptors that changed is shown as the lower bound, since in the best case pruning would 
eliminate the computation of all but these descriptors. This is also the number of evaluations of the 
energy model performed. If the combination of pruning techniques were perfect, it would exactly 
match this lower bound. Results are shown for a 100^ portion of the training dataset. 
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